skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Erez, Mattan"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. The microservice architecture is increasingly popular for flexible, large-scale online applications. However, existing resource management mechanisms incur high latency in detecting Quality of Service (QoS) violations, and hence, fail to allocate resources effectively under commonly-observed varying load conditions. This results in over-allocation coupled with a late response that increase both the total cost of ownership and the magnitude of each QoS violation event. We present SurgeGuard, a decentralized resource controller for microservice applications specifically designed to guard application QoS during surges in load and network latency. SurgeGuard uses the key insight that for rapid detection and effective management of QoS violations, the controller must be aware of any available slack in latency and communication patterns between microservices within a task-graph. Our experiments show that for the workloads in DeathStarBench, SurgeGuard on average reduces the combined violation magnitude and duration by 61.1% and 93.7%, respectively, compared to the well-known Parties and Caladan algorithms, and requires 8% fewer resources than Parties 
    more » « less
    Free, publicly-accessible full text available November 17, 2025
  2. High load latency that results from deep cache hierarchies and relatively slow main memory is an important limiter of single-thread performance. Data prefetch helps reduce this latency by fetching data up the hierarchy before it is requested by load instructions. However, data prefetching has shown to be imperfect in many situations. We propose cache-level prediction to complement prefetchers. Our method predicts which memory hierarchy level a load will access allowing the memory loads to start earlier, and thereby saves many cycles. The predictor provides high prediction accuracy at the cost of just one cycle added latency to L1 misses. Level prediction reduces the memory access latency by 20% on average, and provides speedup of 10.3% over a conventional baseline, and 6.1% over a boosted baseline on generic, graph, and HPC applications. 
    more » « less
  3. Python's ease of use and rich collection of numeric libraries make it an excellent choice for rapidly developing scientific applications. However, composing these libraries to take advantage of complex heterogeneous nodes is still difficult. To simplify writing multi-device code, we created Parla, a heterogeneous task-based programming framework that fully supports Python's scientific programming stack. Parla's API is based on Python decorators and allows users to wrap code in Parla tasks for parallel execution. Parla arrays enable automatic movement of data between devices. The Parla runtime handles resource-aware mapping, scheduling, and execution of tasks. Compared to other Python tasking systems, Parla is unique in its parallelization of tasks within a single process, its GPU context and resource-aware runtime, and its design around gradual adoption to provide easy migration of and integration into existing Python applications. We show that Parla can achieve performance competitive with hand-optimized code while improving ease of development. 
    more » « less
  4. Today, larger memory capacity and higher memory bandwidth are required for better performance and energy efficiency for many important client and datacenter applications. Hardware memory compression provides a promising direction to achieve this without increasing system cost. Unfortunately, current memory compression solutions face two significant challenges. First, keeping memory compressed requires additional memory accesses, sometimes on the critical path, which can cause performance overheads. Second, they require changing the operating system to take advantage of the increased capacity, and to handle incompressible data, which delays deployment. We propose Compresso, a hardware memory compression architecture that minimizes memory overheads due to compression, with no changes to the OS. We identify new data-movement trade-offs and propose optimizations that reduce additional memory movement to improve system efficiency. We propose a holistic evaluation for compressed systems. Our results show that Compresso achieves a 1.85x compression for main memory on average, with a 24% speedup over a competitive hardware compressed system for single-core systems and 27% for multi-core systems. As compared to competitive compressed systems, Compresso not only reduces performance overhead of compression, but also increases performance gain from higher memory capacity. 
    more » « less